Incorporating Pronunciation Variation into Extraction of Transliterated-term Pairs from Web Corpora
نویسندگان
چکیده
A novel approach to automatically extracting transliterated-term pairs from Web corpora is proposed in this paper. One of the most important issues addressed is that of taking pronunciation variation into account. Pronunciation variation is a phenomenon of pronunciation ambiguity that seriously affects the term transliteration and hence affects those results produced by transliteration processes. Extracting transliterated-term pairs is a fundamental yet important task in natural language processing to collect large enough paired cognates for further studies on transliteration. To mitigate the problem of pronunciation variation in extracting paired cognates is not an easy task. The proposed method successfully exploits ASR (automated speech recognition)-generated confusion matrices as a basis for both alleviating pronunciation variation and constructing crosslinguistic syllable-and-phoneme conversions and it improves the extraction performance gradually by using cross-linguistic syllable-phoneme confusion matrices trained and refined progressively from extracted term pairs. Many terms extracted in the experiment are new to the existing lexicons. Experiments on mining information from the extracted pairs also have been conducted. From the experimental results showed that taking pronunciation variation into account did make extraction of paired cognates more effective
منابع مشابه
Generating Paired Transliterated-cognates Using Multiple Pronunciation Characteristics from Web corpora
A novel approach to automatically extracting paired transliterated-cognates from Web corpora is proposed in this paper. One of the most important issues addressed is that of taking multiple pronunciation characteristics into account. Terms from various languages may pronounce very differently. Incorporating the knowledge of word origin may improve the pronunciation accuracy of terms. The accura...
متن کاملIncorporating Pronunciation Variation into Different Strategies of Term Transliteration
Term transliteration addresses the problem of converting terms in one language into their phonetic equivalents in the other language via spoken form. It is especially concerned with proper nouns, such as personal names, place names and organization names. Pronunciation variation refers to pronunciation ambiguity frequently encountered in spoken language, which has a serious impact on term trans...
متن کاملConstructing Transliteration Lexicons from Web Corpora
This paper proposes a novel approach to automating the construction of transliterated-term lexicons. A simple syllable alignment algorithm is used to construct confusion matrices for cross-language syllable-phoneme conversion. Each row in the confusion matrix consists of a set of syllables in the source language that are (correctly or erroneously) matched phonetically and statistically to a syl...
متن کامل應用混淆音矩陣之中英文音譯詞組自動抽取 (Automatic Transliterated-term Extraction Using Confusion Matrix from Non-parallel Corpora) [In Chinese]
متن کامل
Multiword Named Entities Extraction from Cross-Language Text Re-use
In practice, many named entities (NEs) are multiword. Most of the research, done on mining the NEs from the comparable corpora, is focused on the single word transliterated NEs. This work presents an approach to mine Multiword Named Entities (MWNEs) from the text re-use document pairs. Text re-use, at document level, can be seen as noisy parallel or comparable text based on the level of obfusca...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Journal of Chinese Language and Computing
دوره 15 شماره
صفحات -
تاریخ انتشار 2005